Preliminary Notes:

First, I’d like to comment that I will take a unique approach in this report. Since the actual interaction between stakeholders is limited to the information as stated in the Google Data Analytics Capstone Case Study Cours, I am limited in my ability to ask direct questions to the stakeholders for clarification on the purpose of the analysis as well as about the data. I had some questions about the ficticious bike company Cyclistic and the data, which is based on the bike share system in the city of Chicago.

Thus, I did some of my own research on Divvy Bike system in Chicago to help try to clarify my curiosities and concerns. Some of my comments for this report will be from the perspective of inside the role-play/case study, while other will be “meta” comments and will be indicated in bold.

Second, this is an amalgamation of two separate analysies. I found it more relevant to expand beyond the original constraints set out in the instructions, as I could not interact with the actually stakeholders. In the first one, I looked at data, grouped quarterly, starting from April 2019 to March 2020 (as the original r-script prompted to). During this initial analysis, I was prompted by many questions and turned to my own research. After further research on the Divvy website, I discovered some key information about the pricing system and types of bikes. I finished the initial analysis with more questions for the business task at hand.

My research found that the city introduced electronic bikes in around July/August of 2020, in which the purpose of that measure was to increase ridership. Also given that the Covid19 pandemic had just started at the beginning of 2020, I concluded that this might also impact ridership in a major way. I thought it relevant to include more data to capture the effects of these two key factors, in addition to some new bench marks I had established. In the second analysis, I looked at monthly data from June 2020 to May 2021. I then combined the two data sets and added more data (April/May 2020, June 2021) into one bigger analysis, for a total of 27 months of data.

Third, the original R-script was created by Kevin Hartman (Hartman) and was included in the case study assignment. For the sake of this case study, I will assume he is a member of my data analytics team.

Finally, the analysis and conclusions found here are the results of this my combined data analysies built upon the original R-script by Hartman.

Ask

The Company and Stakeholders:

The fictional Cyclistic bike-share program is based on the Divvy bike-share system in Chicago. The company is looking to increase the number of annual subscribers and it is my role to see how riders who have a membership to the system differ from those that don’t, and how the company can boast subscriber numbers through social media.

The following is detailed summary of the Case Study taken from the Capstone Assignment document, and modified slightly for brevity.

The company Cyclistic is a bike-share program with more than 5,800 bicycles and 600 docking stations. The bikes can be unlocked from one station and returned to any other station in the system anytime. The service sets itself apart from other services by offering various other bikes for people with disablities, but the majority of riders (92%) opt for traditional bikes. It has flexible pricing plans: single-ride or full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

The stakeholders are Lily Moreno, the director of marketing and my manager, the marketing analytics team, and the executive team. The executive team is detailed oriented and will make the final decisions. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program, including social media. My team and I are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.

Marketing strategy had relied on building general awareness and appealing to broad consumer segments, and the pricing flexibility helped Cyclistic attract more customers. Now, Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders, and Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of and have chosen the program.

The goal is clear: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand the trends on how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.

Three questions will guide the future marketing program: 1. How do annual members and casual riders use Cyclistic bikes differently? 2. Why would casual riders buy Cyclistic annual memberships? 3. How can Cyclistic use digital media to influence casual riders to become members?

My Goal and Business Task

As stated above in the case-study brief, the assignment clearly sets up the situation and my role in the analysis. I am working with my data analytics team, so for the sake of this role play, I am acting as if Kevin Hartman is a part of my team, and that he performed some initial analysis and wrote some R-code. Therefore, it is assumed that he helped me with this report. His contribution is duly noted here. Furthermore, I am to report directly to Lily Moreno and prepare my findings for the executive team.

With this in mind:

  • My goal is to make recommendations on how to convert casual users to annual users through the influence of digital media.

  • My business task is to find out the difference between how annual and casual members use the bikes in order to answer why a casual member would purchase an annual membership.

This is the starting point of my analysis.

Prepare

About the Data

The data in this analysis was downloaded directly from the Divvy website through AWS. It follows the principles of being reliable, original, comprehensive, current, and cited data. In other words, it ROCCCs!

It is made available by Motivate International Inc. under this license found here. Copies of the data are stored on my computer for use in my analysis. Proper Data Ethics are met with my compliance under this license. In short, I am using it for the sole purpose of analysis in the case study, and not for commercial purposes. In addition, I am not conducting data mining nor trying to correlate data with names or other information of customers. Lastly, I am in no way affiliated, approved, sponsered or endorsed by Motivate International Inc.

After a short struggle with the data and spreadsheets, it quickly became apparent that due to the large nature of the data, I would need to use R to perform my data analysis. Since the data is open and readily available to use, I downloaded and stored copies on my own computer for use with R Studio. I was ready for a quick look at the data.

About the Business Task

I had some initial questions at the back of my mind, and after a quick look at the initial data, I had more questions and I wanted to establish some metrics for the analysis.

Due to the fact that I cannot ask questions directly to any stakeholder, I decided to research the organization that the case study is based on https://www.divvybikes.com. I also found this report by the city of Chicago to be useful. Here are some key takeaways:

  • Single-rides cost $3.30 for 30 min-rides.
  • Full-day passes cost $15 for unlimited 3-hour rides within 24 hour period.
  • Annual membership costs $9/month ($108 billed up front) for unlimited 45-min rides.
  • D4E is a special pass at a reduced rate for people who qualify.
  • There are various overcharges per minute if your ride is longer than that listed above, and the cost depends on your type of membership and bike, but it is about $0.15 per minute.

Based on this cost structure, it would be beneficial to find out how often a non-member uses the system versus members. That is how many single ride trips they take in a year and/or how many day passes they purchase in a year.

Refinement of the Business Task

On a per month basis, the annual membership averages out to be equivalent to the cost of a little less than three single-ride trips per month. Time-wise, in three trips with an annual membership, you get a total 8,100 seconds (135 mins) and $0.90 savings compared with 5,400 seconds (90 mins) purchasing single-ride trips. At a total cost of $108, it is equivalent to 32.7 single-ride trips per year. Factoring in the unlimited 45-min rides within a month, it is far better to be a member than pay for single-ride trips if you are over than 32 ride threshold within a year, even if they have to pay upfront. Regular single-ride users should be purchasing a member pass, especially with D4E where you can get a membership fee at a reduced cost under certain conditions.

Comparing single rides with day passes, in time-wise measures, you get unlimited rides in three hour chunks that would cost $19.80 for one three hour trip in single rider fees, all for $15. Four shorter single ride trips in one day, or getting on and off the bike twice, would cost you $13.20. It is easy to see that anyone intending to stop-and-go more than twice in a day, or taking longer rides, would be better off buying a day pass.

For a casual customer who predominantly uses the system for this type of unlimited daily travel, an annual pass is equivalent to 7.2 day passes in a year. Even with a trade off of shorter ride times (Annual members can pay by the minute if they go over 45 mins), it seems reasonable to buy an annual membership if you are using day passes on the weekend or on other occasions a couple of times a month. In addition, for casual users who buy single rides around 30 times per year, or the equivalent of 5 times a month in half a year, it would be financially viable to purchase an annual membership. Any current customer or potential bike user who fits into either of these two categories just need to be urged to make the cost-benefit calculus of buying an annual membership up front.

One hesitation for buying an annual membership may be the climate. Many people in Chicago are probably discouraged to ride in the cold weather, so it is likely there are people who would only ride from May to October when the weather is favorable. They may view buying an annual membership as a waste of money when considering the fact that they wouldn’t use it for half of the year. But this is the wrong way to look at it.

For the success of the campaign, it is vital to market the casual single-riders who takes advantage of the bikes when the sun is out a couple of times per week, or day pass users who may only ride on weekends in the warmer weather. For customers who lie within the bounds of those benchmarks, even if only riding in the warmer months, the cost-benefit of an annual membership outweighs continuing to purchase non-member rides. If you are a casual member who mixes both buying day passes and single-rides every month in warmer times, or any time of the year for that matter, the trade off of an upfront membership versus unlimited 45 minute rides anytime should be an easy decision: even just taking rides on Saturdays or Sundays each month in the warmer months would be a better financial option in the long run. And, once having the pass in one’s possession, one would be more inclined to use it any time and thus creating a bigger windfall.

[In addition to marketing to day pass users, there would be much potential in marketing to users I think it is best to marker to those riders who fit into the above profiles.] If it is possible to get the details of repeat casual users and how they use the system, it would optimize the best way to market to those individuals. However, because of privacy concerns as stated in the case study assignment sheet “that data-privacy issues prohibit you from using riders’ personally identifiable information…to connect pass purchases…to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.”, it may not be possbile to gather this requisite information, and I will have to rely on the duration of trips and total number of trips, rather than the number of trips an individual user takes, to analyize the data. In a real life situation, I would inquire more to see if there was any way to obtain this information by consulting with the data team.

Therefore, in addition to the questions outlined earlier, some additional questions are:

  1. What is the average duration time for casual riders? Do they tend to be singe-ride users or day pass users?
  2. What percentage of users ride over 30 mins / 45 mins / 3 hours? When do riders tend to take longer trips?

Process

With these new goals and business task in mind, I was ready to process the data.

I started with the original R script by Hartman, available here at https://artscience.blog/home/divvy-dataviz-case-study, as the launching point for my analysis. All code is a modification and expansion of that original script. Some code and results are excluded or hidden for readability.

I used tidyverse, lubridate, ggplot2, and skimr (code excluded) for processing and analysis.

Process and Clean Data: Q2 2019 to Q1 2020

I read in the four data sets: Divvy_Trips_q2_2019.csv to Divvy_Trips_q1_2020.csv (code excluded)

I examined the structure of the data using the skimr and dyplr packages to get a detailed look at it.

glimpse(q1_2020)
glimpse(q4_2019)
glimpse(q3_2019)
glimpse(q2_2019)
skim_without_charts(q1_2020)
skim_without_charts(q4_2019)
skim_without_charts(q3_2019)
skim_without_charts(q2_2019)

The key findings for quarterly data are:

  1. The name of the columns are inconsistent between q2_2019, q1_2020 and the other two quarters.
  2. Bike ID are numeric for all 2019 data, but is non-numeric and called rideable_type.
  3. There is only one unique type of rideable bike: docked_bike.
  4. In the 2019 data, there are two user types, ‘subscriber’ and ‘customer’.
  5. In the 2020 data, the two user types are ‘member’ and ‘casual’, which is the standard that is mentioned in the initial documentation (see earlier).

All of the findings above and the modifications below were done in accordance with the analysis by Hartman.

I renamed all columns to be consistent with q1_2020 data and beyond.

I converted the type of ride_id and rideable_type to characters in order to bind the charts.

I created a data frame: all_trips_quarter

I removed start and end station ID and lat./long. data, as well as gender and birthday fields, which were only in the q2_2019 data.

(q4_2019 <- rename(q4_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid
                   ,started_at = start_time
                   ,ended_at = end_time
                   ,start_station_name = from_station_name
                   ,start_station_id = from_station_id
                   ,end_station_name = to_station_name
                   ,end_station_id = to_station_id
                   ,member_casual = usertype))

(q3_2019 <- rename(q3_2019
                   ,ride_id = trip_id
                   ,rideable_type = bikeid
                   ,started_at = start_time
                   ,ended_at = end_time
                   ,start_station_name = from_station_name
                   ,start_station_id = from_station_id
                   ,end_station_name = to_station_name
                   ,end_station_id = to_station_id
                   ,member_casual = usertype))

(q2_2019 <- rename(q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID"
                   ,started_at = "01 - Rental Details Local Start Time"
                   ,ended_at = "01 - Rental Details Local End Time"
                   ,start_station_name = "03 - Rental Start Station Name"
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name"
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))

q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))

all_trips_quarter <- bind_rows(q2_2019, q3_2019, q4_2019, q1_2020)

all_trips_quarter <- all_trips_quarter %>%
  select(-c(start_station_id, end_station_id, start_lat, start_lng, end_lat, end_lng, birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped",
"05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
glimpse(all_trips_quarter)
skim_without_charts(all_trips_quarter)
table(all_trips_quarter$member_casual)

I transformed the customer and subscriber to member and casual to be consistent with q1_2020 data. A note on this nomenclature was mentioned by Hartman.

In my research, according to the report by the city of Chicago, mentioned earlier and found here, ebikes were introduced in July of 2020, and the data analysis following this first part will show this result as well. I therefore concluded that docked_bike here refers to what will later be classified as classic_bike (a pedal bike) and I decided to transform all values in rideable_type to docked_type from a bike_id number, to keep with this nomenclature. If this were not a role-play, I would confirm this assumption with a stakeholder.

I standardized the bike type to one: docked type.

table(all_trips_quarter$member_casual)
all_trips_quarter <-  all_trips_quarter %>%
  mutate(member_casual = recode(member_casual
                                ,"Subscriber" = "member"
                                ,"Customer" = "casual"))
# Change all former bike_id to "docked_bike"
all_trips_quarter$rideable_type <- c("docked_bike")
which(all_trips_quarter$rideable_type != "docked_bike")

Here is a summary of the final data for all_trips_quarter to confirm this step in the completion of data cleaning.

glimpse(all_trips_quarter)
## Rows: 3,879,822
## Columns: 7
## $ ride_id            <chr> "22178529", "22178530", "22178531", "22178532", "22…
## $ started_at         <dttm> 2019-04-01 00:02:22, 2019-04-01 00:03:02, 2019-04-…
## $ ended_at           <dttm> 2019-04-01 00:09:48, 2019-04-01 00:20:30, 2019-04-…
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ start_station_name <chr> "Daley Center Plaza", "Wood St & Taylor St", "LaSal…
## $ end_station_name   <chr> "Desplaines St & Kinzie St", "Wabash Ave & Roosevel…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
unique(all_trips_quarter$member_casual)
## [1] "member" "casual"
unique(all_trips_quarter$rideable_type)
## [1] "docked_bike"

Process and Clean Data: April 2020 to June 2021

I read in the fifteen data sets: 202004-divvy-tripdata.csv to 202106-divvy-tripdata.csv (code excluded)

I examined the structure of the data to get a detailed look at it.

glimpse(april_2020)
glimpse(may_2020)
glimpse(june_2020)
glimpse(july_2020)
glimpse(aug_2020)
glimpse(sept_2020)
glimpse(oct_2020)
glimpse(nov_2020)
glimpse(dec_2020)
glimpse(jan_2021)
glimpse(feb_2021)
glimpse(march_2021)
glimpse(april_2021)
glimpse(may_2021)
glimpse(june_2021)
skim_without_charts(april_2020)
skim_without_charts(may_2020)
skim_without_charts(june_2020)
skim_without_charts(july_2020)
skim_without_charts(aug_2020)
skim_without_charts(sept_2020)
skim_without_charts(oct_2020)
skim_without_charts(nov_2020)
skim_without_charts(dec_2020)
skim_without_charts(jan_2021)
skim_without_charts(feb_2021)
skim_without_charts(march_2021)
skim_without_charts(april_2021)
skim_without_charts(may_2021)
skim_without_charts(june_2021)

The key findings for the monthly data are:

  1. Start and End Station ID before Dec 2020 are numeric, but are characters after.
  2. There are three unique bike types from Dec 2020: electric, classic and docked.
  3. There are two unique bike types July - Nov 2020: electric and docked.
  4. There is only one unique bike type from April - June 2020: docked.
  5. The number of total rows equals the number of unique ride IDs.

As mentioned in the previously mentioned report, “classic pedal bikes will continue to be parked at Divvy docking stations only [while] ebikes can be parked at Divvy docking stations, Divvy E-stations, or any public bike rack.” The report also mentions that “Divvy E-stations are clusters of bike racks.” In addition, I also examined the Divvy Bike Map and it indicates that there are stations with docks where there are no ebikes, and ebike exclusive stations where it says, “Parking for pedal bikes can be found at the docks icons on your map”, suggesting that docks refer to pedal bikes. Therefore, I will assume for the sake of this report that from December 2020 onward, when there are three types of bikes, that docked type continues to refer to a type of pedal bike. Still, considering that ebikes can be docked at any docking station, it is not 100% clear that this is the case. Please keep this in mind since I cannot determine absolute certainty due to the limitations of the case study and would confirm with a member of the Divvy Team in a real situation before proceeding.

Here are three summaries of the monthly data to show the differences in the data outlined above.

glimpse(april_2020)
## Rows: 84,776
## Columns: 13
## $ ride_id            <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4…
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at         <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-…
## $ ended_at           <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-…
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu…
## $ start_station_id   <dbl> 86, 503, 142, 216, 125, 173, 35, 434, 627, 377, 508…
## $ end_station_name   <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "…
## $ end_station_id     <dbl> 152, 499, 255, 657, 323, 35, 635, 382, 359, 508, 37…
## $ start_lat          <dbl> 41.8964, 41.9244, 41.8945, 41.9030, 41.8902, 41.896…
## $ start_lng          <dbl> -87.6610, -87.7154, -87.6179, -87.6975, -87.6262, -…
## $ end_lat            <dbl> 41.9322, 41.9306, 41.8679, 41.8992, 41.9695, 41.892…
## $ end_lng            <dbl> -87.6586, -87.7238, -87.6230, -87.6722, -87.6547, -…
## $ member_casual      <chr> "member", "member", "member", "member", "casual", "…
glimpse(july_2020)
## Rows: 551,480
## Columns: 13
## $ ride_id            <chr> "762198876D69004D", "BEC9C9FBA0D4CF1B", "D2FD8EA432…
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at         <dttm> 2020-07-09 15:22:02, 2020-07-24 23:56:30, 2020-07-…
## $ ended_at           <dttm> 2020-07-09 15:25:52, 2020-07-25 00:20:17, 2020-07-…
## $ start_station_name <chr> "Ritchie Ct & Banks St", "Halsted St & Roscoe St", …
## $ start_station_id   <dbl> 180, 299, 329, 181, 268, 635, 113, 211, 176, 31, 14…
## $ end_station_name   <chr> "Wells St & Evergreen Ave", "Broadway & Ridge Ave",…
## $ end_station_id     <dbl> 291, 461, 156, 94, 301, 289, 140, 31, 191, 142, 31,…
## $ start_lat          <dbl> 41.90687, 41.94367, 41.93259, 41.89076, 41.91172, 4…
## $ start_lng          <dbl> -87.62622, -87.64895, -87.63643, -87.63170, -87.626…
## $ end_lat            <dbl> 41.90672, 41.98404, 41.93650, 41.91831, 41.90799, 4…
## $ end_lng            <dbl> -87.63483, -87.66027, -87.64754, -87.63628, -87.631…
## $ member_casual      <chr> "member", "member", "casual", "casual", "member", "…
glimpse(dec_2020)
## Rows: 131,573
## Columns: 13
## $ ride_id            <chr> "70B6A9A437D4C30D", "158A465D4E74C54A", "5262016E0F…
## $ rideable_type      <chr> "classic_bike", "electric_bike", "electric_bike", "…
## $ started_at         <dttm> 2020-12-27 12:44:29, 2020-12-18 17:37:15, 2020-12-…
## $ ended_at           <dttm> 2020-12-27 12:55:06, 2020-12-18 17:44:19, 2020-12-…
## $ start_station_name <chr> "Aberdeen St & Jackson Blvd", NA, NA, NA, NA, NA, N…
## $ start_station_id   <chr> "13157", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ end_station_name   <chr> "Desplaines St & Kinzie St", NA, NA, NA, NA, NA, NA…
## $ end_station_id     <chr> "TA1306000003", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.87773, 41.93000, 41.91000, 41.92000, 41.80000, 4…
## $ start_lng          <dbl> -87.65479, -87.70000, -87.69000, -87.70000, -87.590…
## $ end_lat            <dbl> 41.88872, 41.91000, 41.93000, 41.91000, 41.80000, 4…
## $ end_lng            <dbl> -87.64445, -87.70000, -87.70000, -87.70000, -87.590…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
unique(april_2020$rideable_type)
## [1] "docked_bike"
unique(july_2020$rideable_type)
## [1] "docked_bike"   "electric_bike"
unique(dec_2020$rideable_type)
## [1] "classic_bike"  "electric_bike" "docked_bike"

I converted all pre-Dec 2020 station ID data to character type in order to be able to bind the data.

I created a data frame: all_trips_monthly

I removed start and end station ID and lat./long. data

april_2020 <-  mutate(april_2020, start_station_id = as.character(start_station_id)
                     , end_station_id = as.character(end_station_id))

may_2020 <-  mutate(may_2020, start_station_id = as.character(start_station_id)
                     , end_station_id = as.character(end_station_id))

june_2020 <-  mutate(june_2020, start_station_id = as.character(start_station_id)
                     , end_station_id = as.character(end_station_id))

july_2020 <-  mutate(july_2020, start_station_id = as.character(start_station_id)
                     , end_station_id = as.character(end_station_id))

aug_2020 <-  mutate(aug_2020, start_station_id = as.character(start_station_id)
                    , end_station_id = as.character(end_station_id))

sept_2020 <-  mutate(sept_2020, start_station_id = as.character(start_station_id)
                     , end_station_id = as.character(end_station_id))

oct_2020 <-  mutate(oct_2020, start_station_id = as.character(start_station_id)
                    , end_station_id = as.character(end_station_id))

nov_2020 <-  mutate(nov_2020, start_station_id = as.character(start_station_id)
                    , end_station_id = as.character(end_station_id))

all_trips_monthly <- bind_rows(april_2020, may_2020, june_2020, july_2020,
                               aug_2020, sept_2020, oct_2020, nov_2020,
                               dec_2020, jan_2021, feb_2021, march_2021,
                               april_2021, may_2021, june_2021)

all_trips_monthly <- all_trips_monthly %>%
  select(-c(start_station_id, end_station_id, start_lat, start_lng, end_lat, end_lng))

glimpse(all_trips_monthly)
skim_without_charts(all_trips_monthly)

Here is a summary of the final data for all_trips_monthly to confirm this step in the completion of data cleaning.

Note: There is a slight difference, 209, between the number of rows and the number of unique ride_ids. The numbers were unique and matched for each month, so I am not concerned about this small number affecting the results.

glimpse(all_trips_monthly)
## Rows: 5,088,206
## Columns: 7
## $ ride_id            <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4…
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at         <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-…
## $ ended_at           <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-…
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu…
## $ end_station_name   <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "…
## $ member_casual      <chr> "member", "member", "member", "member", "casual", "…
nrow(all_trips_monthly)
## [1] 5088206
length(unique(all_trips_monthly$ride_id))
## [1] 5087997

I checked if the column name’s for quarterly and monthly data matched in both data frames, and then bound them together.

I created a combined data frame: all_trips

colnames(all_trips_quarter)
colnames(all_trips_monthly)

all_trips <- bind_rows(all_trips_quarter, all_trips_monthly)
skim_without_charts(all_trips)

Here is a quick confirmation of the combined data frame.

glimpse(all_trips)
## Rows: 8,968,028
## Columns: 7
## $ ride_id            <chr> "22178529", "22178530", "22178531", "22178532", "22…
## $ started_at         <dttm> 2019-04-01 00:02:22, 2019-04-01 00:03:02, 2019-04-…
## $ ended_at           <dttm> 2019-04-01 00:09:48, 2019-04-01 00:20:30, 2019-04-…
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ start_station_name <chr> "Daley Center Plaza", "Wood St & Taylor St", "LaSal…
## $ end_station_name   <chr> "Desplaines St & Kinzie St", "Wabash Ave & Roosevel…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

I added columns that list the date, month, day, and year of each ride and checked the data. As Hartman noted in his report (R-script), the original data “at the ride level..is too granular.”

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%a")
glimpse(all_trips)

Trip duration was dropped from the original data in q1_2020, and also I removed that data column for 2019 earlier.

I recalculated trip duration and created a new metric in a column called ride_length that shows the duration of each trip.

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
skim_without_charts(all_trips$ride_length)

I converted the data to numeric type in order to perform calculations on the data.

is.factor(all_trips$ride_length)
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
glimpse(all_trips$ride_length)
skim_without_charts(all_trips)

The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative, as stated by Hartman.

I confirmed this by examining the data to find negative durations and a station named ‘HQ QR’, as seen below.

all_trips %>%
  count(ride_length < 0)
## # A tibble: 2 x 2
##   `ride_length < 0`       n
##   <lgl>               <int>
## 1 FALSE             8957334
## 2 TRUE                10694
all_trips %>%
  count(start_station_name == "HQ QR")
## # A tibble: 3 x 2
##   `start_station_name == "HQ QR"`       n
##   <lgl>                             <int>
## 1 FALSE                           8682193
## 2 TRUE                               3767
## 3 NA                               282068

I created a new version of the data frame (v2) since data is being removed, as per Hartman’s advice. A quick confirmation and summary is included.

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
glimpse(all_trips_v2)
## Rows: 8,953,684
## Columns: 13
## $ ride_id            <chr> "22178529", "22178530", "22178531", "22178532", "22…
## $ started_at         <dttm> 2019-04-01 00:02:22, 2019-04-01 00:03:02, 2019-04-…
## $ ended_at           <dttm> 2019-04-01 00:09:48, 2019-04-01 00:20:30, 2019-04-…
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ start_station_name <chr> "Daley Center Plaza", "Wood St & Taylor St", "LaSal…
## $ end_station_name   <chr> "Desplaines St & Kinzie St", "Wabash Ave & Roosevel…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2019-04-01, 2019-04-01, 2019-04-01, 2019-04-01, 20…
## $ month              <chr> "04", "04", "04", "04", "04", "04", "04", "04", "04…
## $ day                <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01…
## $ year               <chr> "2019", "2019", "2019", "2019", "2019", "2019", "20…
## $ day_of_week        <chr> "Mon", "Mon", "Mon", "Mon", "Mon", "Mon", "Mon", "M…
## $ ride_length        <dbl> 446, 1048, 252, 357, 1007, 257, 548, 383, 2137, 212…

The data is now ready for analysis.

Analyze

My main focus was to look at how member and casual riders differ by examining two main measures: duration and bike type. I wanted to how long riders used the bikes in accordance with my benchmarks, and I wanted to see if the introduction of ebikes increased ridership

Brief Review of Analysis One and Two

In the original analysis following Hartman (the original quarterly data set), I calculated some basic statistics for duration (ride_length), which was followed by aggregating duration by user type (member_casual). These stats were then proceeded by further aggregation by day of week with some accompanying charts. I did the same process when analyzing the monthly data in the second analysis, where I also made some additional visualizations. In order to get a better picture, I aggregated by bike type (rideable_type) and year, by user type and month, by user type and month and year, and by bike type and month from June 2020 to the end of the data set.

The code below shows the original R-script with Hartman’s comments plus my additional code. It is for the combined data set and is here to demonstrate my thinking process. The results and visualizations are not shown.

summary(all_trips_v2$ride_length)
# Compare members and casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

# See the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

# Notice that the days of the week are out of order. Let's fix that.
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))

# Now, let's run the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

# analyze ridership data by type and weekday
all_trips_v2 %>%
  mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()
  group_by(member_casual, weekday) %>%  #groups by usertype and weekday
  summarise(number_of_rides = n()                           #calculates the number of rides and average duration
            ,average_duration = mean(ride_length)) %>%      # calculates the average duration
  arrange(member_casual, weekday)                               # sorts

# Let's visualize the number of rides by rider type
all_trips_v2 %>%
  mutate(weekday = wday(started_at, label = TRUE)) %>%
  group_by(member_casual, weekday) %>%
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(member_casual, weekday)  %>%
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

# Let's visualize average duration by rider type
all_trips_v2 %>%
  mutate(weekday = wday(started_at, label = TRUE)) %>%
  group_by(member_casual, weekday) %>%
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(member_casual, weekday)  %>%
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

#-------------------------------------------------------------
# From here is my code - thanks to Hartman for the assistance
#-------------------------------------------------------------

# A visualization showing proportion of member/casual by bike type and year
all_trips_v2 %>%
  drop_na(member_casual, rideable_type) %>%
  group_by(member_casual, rideable_type) %>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual)) +
  facet_wrap(~year)

# A visualization showing proportion of member/casual by month
all_trips_v2 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month) %>%
  ggplot(aes(x = month)) +
  geom_bar(aes(fill = member_casual))

# A visualization showing proportion of member/casual by month and year
all_trips_v2 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month) %>%
  ggplot(aes(x = month)) +
  geom_bar(aes(fill = member_casual)) +
  facet_wrap(~year)

# N.B. some of the aggregations mentioned above are not shown here and can be seen later sections

# A visualization showing average duration by month and year, member/casual
all_trips_v2 %>%
  drop_na(member_casual, year, month) %>%
  group_by(member_casual, year, month) %>%
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(member_casual, year, month) %>%   
  ggplot(aes(x = month, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  facet_wrap(~year)

# Visualizations of proportion of members/casual by type of bike for each month in 2020

# For 2020, I used the monthly tables
june_2020%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

july_2020%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

aug_2020%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

sept_2020%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

oct_2020%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

nov_2020%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

dec_2020%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

# For 2021,I filtered the all_trips table
all_trips_v2%>%
  filter(month == "01")%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

all_trips_v2%>%
  filter(month == "02")%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

all_trips_v2%>%
  filter(month == "03")%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

all_trips_v2%>%
  filter(month == "04")%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

all_trips_v2%>%
  filter(month == "05")%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

all_trips_v2%>%
  filter(month == "06")%>%
  group_by(member_casual, rideable_type)%>%
  ggplot(aes(x = rideable_type)) +
  geom_bar(aes(fill = member_casual))

I was planning to explore this second data set more. However, it was at this point that I decided that I wanted to combine the data sets and fill in the gaps in order to get a complete picture of the effects of Covid and ebikes. One main reason was because I noticed a proportional decrease in casual members as summer turned to winter, but a proportional increase in 2021 as winter gave way to spring. I wanted to see the changes across a time span of more than one year, so I created a data set of two years and one quarter.

The New Data Set: Part 1

In the combined analysis, I retraced my steps repeating the same cleaning and analyzing to check the data (see the code above), and then I jumped straight to a look at duration and rider count by month.

Looking at the visualization for monthly stats, I noticed some unusual high durations for January and February, as seen in the chart below. This is also very noticeable when separated into year (chart not shown) This was puzzling because in winter months, ride time would normally be lower given the cold. After noticing this, I took a side-step to explore the source of this out-of-place result. The yearly chart (see below in the next section) pointed towards the beginning of 2020 as the place to look.

all_trips_v2 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month) %>%  
  summarise(number_of_rides = n()   
            ,average_duration = mean(ride_length)) %>%
  arrange(member_casual, month) %>% 
  ggplot(aes(x = month, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

A Monthly Deep Dive: More Data Processing and Cleaning

In the data aggregated by year, month and member_casual, you can clearly see some very high averages at the beginning of 2020 which is distorting average duration.

#Stats by year/month
all_trips_v2 %>%
  drop_na(member_casual, month, year) %>%
  group_by(member_casual, month, year) %>%  
  summarise(number_of_rides = n()       
            ,average_duration = mean(ride_length)) %>%  
  arrange(member_casual, month, year) 
## # A tibble: 54 x 5
## # Groups:   member_casual, month [24]
##    member_casual month year  number_of_rides average_duration
##    <chr>         <chr> <chr>           <int>            <dbl>
##  1 casual        01    2020             7785            9699.
##  2 casual        01    2021            15916            1628.
##  3 casual        02    2020            12314            7997.
##  4 casual        02    2021             9219            3160.
##  5 casual        03    2020            24615            4250.
##  6 casual        03    2021            78867            2362.
##  7 casual        04    2019            47744            3057.
##  8 casual        04    2020            23610            4388.
##  9 casual        04    2021           126181            2370.
## 10 casual        05    2019            81624            3074.
## # … with 44 more rows

Looking at the maximum values of ride_length in more detail below, for all data from the monthly data set (April 2020 and onward), there are some max values around 3,000,000 (3M), but none higher than about 3.5M. From the quarterly data, we see all values over 2M, with a max duration as high as around 9M- three times higher than the highest in the other data set. For perspective, that is equal to about 34 days versus 104 days. It seems odd that there would be a bike trip lasting 3 months, let alone 1 month!

Even though there are some high max values in other months, they only seem to have an adverse effect on the winter months and casual users. This is not logical given the winter conditions of a city in the north next to Lake Michigan in addition to the stats that are lower on all major measures and months. I started to suspect that there was something going on.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$month + all_trips_v2$year, FUN = max)
##    all_trips_v2$month all_trips_v2$year all_trips_v2$ride_length
## 1                  04              2019                  4757638
## 2                  05              2019                  4519226
## 3                  06              2019                  2348763
## 4                  07              2019                  9056634
## 5                  08              2019                  7667470
## 6                  09              2019                  4563322
## 7                  10              2019                  8582302
## 8                  11              2019                  3307285
## 9                  12              2019                  3246842
## 10                 01              2020                  9387024
## 11                 02              2020                  8636205
## 12                 03              2020                  5627611
## 13                 04              2020                  3523202
## 14                 05              2020                  1733813
## 15                 06              2020                  2476260
## 16                 07              2020                  2997927
## 17                 08              2020                  2450782
## 18                 09              2020                  3257001
## 19                 10              2020                  2143463
## 20                 11              2020                  2156040
## 21                 12              2020                   584459
## 22                 01              2021                  1189555
## 23                 02              2021                  1807754
## 24                 03              2021                  1900899
## 25                 04              2021                  2866602
## 26                 05              2021                  3235296
## 27                 06              2021                  3356649
# aggregate(all_trips_v2$ride_length ~ all_trips_v2$month + all_trips_v2$year, FUN = mean)

From here, I honed in on the first three months where the duration averages were unusual, suspecting that some outlier data points could be the source of the distortion. I compared some key stats for the first three months of 2020 and 2021 as seen in the chart below. Where there were some very high data points, the median and mean were very far apart from each other. This suggests that there are enough of these outlier data points to distort the average. I investigated further by looking at the density of the three months across the two years.

Comparing the scatter plot for Q1 of 2020 with random noise with one for Q1 of 2021, they clearly show a lot of distortion coming from casual members in 2020. Note that in 2021, there is no distortion in the member side, and all data points are under 2M, with most outliers under 500K on the casual side. This is not the case in 2020, where there are quite a lot of data points over 500K comparatively speaking. This reinforced my suspicions that outlier data points were not actual trips and were affecting the data in negative way.

all_trips_v2 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month, year) %>%
  summarise(count = n(), median = median(ride_length),
            mean = mean(ride_length), max = max(ride_length)) %>%
  filter(month == "01" | month == "02" | month == "03" ) %>%
  arrange(member_casual, month, year)
## # A tibble: 12 x 7
## # Groups:   member_casual, month [6]
##    member_casual month year   count median  mean     max
##    <chr>         <chr> <chr>  <int>  <dbl> <dbl>   <dbl>
##  1 casual        01    2020    7785   1028 9699. 9387024
##  2 casual        01    2021   15916    747 1628. 1189555
##  3 casual        02    2020   12314   1297 7997. 8636205
##  4 casual        02    2021    9219    998 3160. 1807754
##  5 casual        03    2020   24615   1549 4250. 3435010
##  6 casual        03    2021   78867   1154 2362. 1900899
##  7 member        01    2020  136099    493  669.  625452
##  8 member        01    2021   72291    526  778.   89997
##  9 member        02    2020  126714    494  768. 4753051
## 10 member        02    2021   36357    614 1105.   89997
## 11 member        03    2020  115593    570  860. 5627611
## 12 member        03    2021  134779    602  839.   93596
q1_2020$ride_length <- difftime(q1_2020$ended_at,q1_2020$started_at)
q1_2020$ride_length <- as.numeric(as.character(q1_2020$ride_length))
q1_2020$date <- as.Date(q1_2020$started_at) #The default format is yyyy-mm-dd
q1_2020$month <- format(as.Date(q1_2020$date), "%m")

q1_2020 %>%
  group_by(month, ride_length) %>%
  ggplot(aes(x = month, y = ride_length, colour = member_casual)) +
  geom_jitter() +
  facet_wrap(~member_casual) + 
  ggtitle("Jan to Mar 2020")

all_trips_v2 %>%
  filter(month == "01" & year == "2021"  | month == "02" & year == "2021"  | month == "03" & year == "2021" ) %>%
  ggplot(aes(x = month, y = ride_length, colour = member_casual)) +
  geom_jitter() +
  facet_wrap(~member_casual) +
  ggtitle("Jan to Mar 2021")

With all majority of the high data points coming from casual users and having dropped the maintenance stations from the data, I concluded that there must be some anomaly in the data, like thefts, lost bikes, bikes not docked properly, a glitch in the software tracking the bikes, or something else causing the high duration times that are not intentional by casual riders.

In this situation in a real scenario, I would inquire further with the data team, the data engineers, or someone in charge of the collection of data about these high data point and their possible cause. For the purpose of this analysis, I will continue on with my assumption.

From my research on the bike share system, day trip passes are valid for 24 hours and allow 3 hour rides before docking is necessary to avoid extra payment. Under the assumption that the high values distorting the data are not actual rides, I decided to see how removing data over these two bench marks affected the key measures, mainly average duration.

I made three new data frame that contains the monthly data aggregated by year and user type, called monthly_user_stats, monthly_user_stats_3hr, monthly_user_stats_24hr. I also calculated how much data is lost when removing rides that exceeded those duration benchmarks. As you can see below, the number of rides over 3 hours is only a little over half a percent, while the number of rides over 24 hours is negligible.

# Monthly table for all trips with duration greater than 0
monthly_user_stats <- all_trips_v2 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month, year) %>%
  summarise(total = n(), median_ride = median(ride_length), max_ride = max(ride_length)
            , average_duration = mean(ride_length)) %>%
  arrange(member_casual, month, year) %>%
  rename(user = member_casual)

# Create a table for trips under 10800 seconds/3 hrs
monthly_user_stats_3hrs <- all_trips_v2 %>%
  filter(ride_length <= 10800) %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month, year) %>%
  summarise(total = n(), median_ride = median(ride_length), max_ride = max(ride_length)
            , average_duration = mean(ride_length)) %>%
  arrange(member_casual, month, year) %>%
  rename(user = member_casual)

m <- nrow(filter(all_trips_v2, !(ride_length <= 10800)))
n <- nrow(all_trips_v2)
lossage <- m/n*100
capture.output(cat("The amount of data lost is if only include rides up to 3 hours:",lossage, "%"))
## [1] "The amount of data lost is if only include rides up to 3 hours: 0.5613332 %"
# Create a table for trips under 86400 seconds/24 hrs
monthly_user_stats_24hrs <- all_trips_v2 %>%
  filter(ride_length <= 86400) %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month, year) %>%
  summarise(total = n(), median_ride = median(ride_length), max_ride = max(ride_length)
            ,average_duration = mean(ride_length)) %>%
  arrange(member_casual, month, year) %>%
  rename(user = member_casual)

m <- nrow(filter(all_trips_v2, !(ride_length <= 86400)))
lossage <- m/n*100
capture.output(cat("The amount of data lost is if only include rides up to 24 hours:",lossage,"%"))
## [1] "The amount of data lost is if only include rides up to 24 hours: 0.06914472 %"

I then wanted to see how strongly these large ride times were impacting the ride length data. I created two monthly comparison charts that displayed the type of user, the month and year, the number of rides and the average duration of the whole data set compared to the reduced data set at 3 hrs and 24 hrs.

# 3 hour monthly data frame comparison
monthly_stat_comparison_3hr <- bind_cols(monthly_user_stats, monthly_user_stats_3hrs)
monthly_stat_comparison_3hr <- monthly_stat_comparison_3hr %>%
  select(c(1,2,3,4,7,11,14)) %>%
  rename("user"=1, "mo."=2, "yr."=3, "rides"=4, "avg_dur"=5, "rides_3hr"=6, "avg_dur_3hr"=7)
monthly_stat_comparison_3hr$diff_rides <- (monthly_stat_comparison_3hr$rides - monthly_stat_comparison_3hr$rides_3hr)
monthly_stat_comparison_3hr$diff_dur <- (monthly_stat_comparison_3hr$avg_dur - monthly_stat_comparison_3hr$avg_dur_3hr)

# 24 hour monthly data frame comparison
monthly_stat_comparison_24hr <- bind_cols(monthly_user_stats, monthly_user_stats_24hrs)
monthly_stat_comparison_24hr <- monthly_stat_comparison_24hr %>%
  select(c(1,2,3,4,7,11,14)) %>%
  rename("user"=1, "mo."=2, "yr."=3, "rides"=4, "avg_dur"=5, "rides_24hr"=6, "avg_dur_24hr"=7)
monthly_stat_comparison_24hr$diff_rides <- (monthly_stat_comparison_24hr$rides - monthly_stat_comparison_24hr$rides_24hr)
monthly_stat_comparison_24hr$diff_dur <- (monthly_stat_comparison_24hr$avg_dur - monthly_stat_comparison_24hr$avg_dur_24hr)

print(monthly_stat_comparison_3hr, n = 54)
## # A tibble: 54 x 9
##    user   mo.   yr.    rides avg_dur rides_3hr avg_dur_3hr diff_rides diff_dur
##    <chr>  <chr> <chr>  <int>   <dbl>     <int>       <dbl>      <int>    <dbl>
##  1 casual 01    2020    7785   9699.      7504       1452.        281   8247. 
##  2 casual 01    2021   15916   1628.     15798       1157.        118    471. 
##  3 casual 02    2020   12314   7997.     11991       1759.        323   6238. 
##  4 casual 02    2021    9219   3160.      9038       1509.        181   1651. 
##  5 casual 03    2020   24615   4250.     24136       2048.        479   2202. 
##  6 casual 03    2021   78867   2362.     77980       1702.        887    660. 
##  7 casual 04    2019   47744   3057.     47278       2257.        466    799. 
##  8 casual 04    2020   23610   4388.     23198       2031.        412   2356. 
##  9 casual 04    2021  126181   2370.    124708       1691.       1473    679. 
## 10 casual 05    2019   81624   3074.     80617       2267.       1007    807. 
## 11 casual 05    2020   86844   3073.     84781       2222.       2063    851. 
## 12 casual 05    2021  231178   2416.    228255       1734.       2923    682. 
## 13 casual 06    2019  130218   2755.    128625       2195.       1593    560. 
## 14 casual 06    2020  154551   3100.    151063       2115.       3488    985. 
## 15 casual 06    2021  327831   2361.    323835       1605.       3996    756. 
## 16 casual 07    2019  175632   3587.    173360       2219.       2272   1368. 
## 17 casual 07    2020  268606   3598.    261867       2215.       6739   1383. 
## 18 casual 08    2019  186889   4020.    184345       2168.       2544   1852. 
## 19 casual 08    2020  285379   2708.    280653       1959.       4726    749. 
## 20 casual 09    2019  129173   3100.    127588       2028.       1585   1072. 
## 21 casual 09    2020  221367   2330.    218867       1743.       2500    586. 
## 22 casual 10    2019   71035   3540.     70002       1827.       1033   1712. 
## 23 casual 10    2020  131259   1886.    130180       1503.       1079    383. 
## 24 casual 11    2019   18723   4022.     18322       1513.        401   2509. 
## 25 casual 11    2020   78556   2004.     77914       1577.        642    427. 
## 26 casual 12    2019   16430   3800.     16098       1698.        332   2102. 
## 27 casual 12    2020   26449   1699.     26257       1274.        192    425. 
## 28 member 01    2020  136099    669.    136005        624.         94     44.7
## 29 member 01    2021   72291    778.     72142        717.        149     61.9
## 30 member 02    2020  126714    768.    126631        637.         83    132. 
## 31 member 02    2021   36357   1105.     36131        829.        226    276. 
## 32 member 03    2020  115593    860.    115483        751.        110    110. 
## 33 member 03    2021  134779    839.    134639        807.        140     32.2
## 34 member 04    2019  217566    811.    217386        727.        180     83.3
## 35 member 04    2020   61115   1289.     60991       1029.        124    260. 
## 36 member 04    2021  184989    882.    184733        838.        256     43.8
## 37 member 05    2019  285834    831.    285556        769.        278     61.7
## 38 member 05    2020  113258   1186.    113028       1117.        230     69.4
## 39 member 05    2021  246710    885.    246378        843.        332     42.0
## 40 member 06    2019  345177    873.    344793        814.        384     58.9
## 41 member 06    2020  187985   1124.    187648       1055.        337     68.8
## 42 member 06    2021  321667    882.    321209        834.        458     48.0
## 43 member 07    2019  381683    986.    381325        838.        358    148. 
## 44 member 07    2020  280980   1066.    280512       1005.        468     61.3
## 45 member 08    2019  403295    971.    402999        814.        296    157. 
## 46 member 08    2020  326618   1010.    326216        949.        402     61.3
## 47 member 09    2019  364046    848.    363826        775.        220     72.6
## 48 member 09    2020  289768    933.    289400        877.        368     56.8
## 49 member 10    2019  300751    782.    300563        703.        188     79.0
## 50 member 10    2020  224285    851.    223941        797.        344     54.4
## 51 member 11    2019  158440    746.    158300        644.        140    102. 
## 52 member 11    2020  155976    824.    155832        796.        144     28.3
## 53 member 12    2019  138662    685.    138555        640.        107     44.7
## 54 member 12    2020   93033    770.     92924        731.        109     38.6
print(monthly_stat_comparison_24hr, n = 54)
## # A tibble: 54 x 9
##    user   mo.   yr.    rides avg_dur rides_24hr avg_dur_24hr diff_rides diff_dur
##    <chr>  <chr> <chr>  <int>   <dbl>      <int>        <dbl>      <int>    <dbl>
##  1 casual 01    2020    7785   9699.       7721        2261.         64  7438.  
##  2 casual 01    2021   15916   1628.      15889        1334.         27   293.  
##  3 casual 02    2020   12314   7997.      12251        2355.         63  5642.  
##  4 casual 02    2021    9219   3160.       9161        1905.         58  1255.  
##  5 casual 03    2020   24615   4250.      24515        2464.        100  1786.  
##  6 casual 03    2021   78867   2362.      78715        1932.        152   430.  
##  7 casual 04    2019   47744   3057.      47669        2452.         75   605.  
##  8 casual 04    2020   23610   4388.      23507        2366.        103  2022.  
##  9 casual 04    2021  126181   2370.     125950        1921.        231   449.  
## 10 casual 05    2019   81624   3074.      81507        2495.        117   579.  
## 11 casual 05    2020   86844   3073.      86666        2669.        178   404.  
## 12 casual 05    2021  231178   2416.     230770        1976.        408   440.  
## 13 casual 06    2019  130218   2755.     130066        2425.        152   330.  
## 14 casual 06    2020  154551   3100.     154216        2543.        335   557.  
## 15 casual 06    2021  327831   2361.     327230        1834.        601   526.  
## 16 casual 07    2019  175632   3587.     175433        2437.        199  1150.  
## 17 casual 07    2020  268606   3598.     267939        2678.        667   919.  
## 18 casual 08    2019  186889   4020.     186625        2409.        264  1611.  
## 19 casual 08    2020  285379   2708.     284923        2260.        456   449.  
## 20 casual 09    2019  129173   3100.     128987        2268.        186   832.  
## 21 casual 09    2020  221367   2330.     221095        1947.        272   383.  
## 22 casual 10    2019   71035   3540.      70889        2129.        146  1410.  
## 23 casual 10    2020  131259   1886.     131098        1670.        161   216.  
## 24 casual 11    2019   18723   4022.      18647        2058.         76  1964.  
## 25 casual 11    2020   78556   2004.      78465        1735.         91   269.  
## 26 casual 12    2019   16430   3800.      16365        2231.         65  1569.  
## 27 casual 12    2020   26449   1699.      26408        1422.         41   278.  
## 28 member 01    2020  136099    669.     136082         646.         17    22.5 
## 29 member 01    2021   72291    778.      72281         766.         10    12.3 
## 30 member 02    2020  126714    768.     126695         659.         19   110.  
## 31 member 02    2021   36357   1105.      36298         960.         59   144.  
## 32 member 03    2020  115593    860.     115566         776.         27    84.5 
## 33 member 03    2021  134779    839.     134769         833.         10     6.65
## 34 member 04    2019  217566    811.     217531         748.         35    62.6 
## 35 member 04    2020   61115   1289.      61095        1082.         20   206.  
## 36 member 04    2021  184989    882.     184965         871.         24    11.5 
## 37 member 05    2019  285834    831.     285793         794.         41    36.9 
## 38 member 05    2020  113258   1186.     113236        1160.         22    25.9 
## 39 member 05    2021  246710    885.     246676         873.         34    12.2 
## 40 member 06    2019  345177    873.     345135         843.         42    30.2 
## 41 member 06    2020  187985   1124.     187963        1098.         22    26.0 
## 42 member 06    2021  321667    882.     321603         864.         64    17.7 
## 43 member 07    2019  381683    986.     381615         861.         68   125.  
## 44 member 07    2020  280980   1066.     280936        1042.         44    24.3 
## 45 member 08    2019  403295    971.     403241         832.         54   139.  
## 46 member 08    2020  326618   1010.     326579         976.         39    34.3 
## 47 member 09    2019  364046    848.     364004         792.         42    56.3 
## 48 member 09    2020  289768    933.     289729         901.         39    32.8 
## 49 member 10    2019  300751    782.     300717         720.         34    61.8 
## 50 member 10    2020  224285    851.     224241         830.         44    21.6 
## 51 member 11    2019  158440    746.     158394         666.         46    79.9 
## 52 member 11    2020  155976    824.     155956         813.         20    11.4 
## 53 member 12    2019  138662    685.     138647         663.         15    22.3 
## 54 member 12    2020   93033    770.      93021         758.         12    11.5
compare_3hrs_to_24_hrs_duration <- c(monthly_stat_comparison_24hr$diff_dur - monthly_stat_comparison_3hr$diff_dur)

compare_3hrs_to_24_hrs_duration
##  [1] -808.95984 -177.41104 -595.82557 -395.74891 -416.47501 -229.99131
##  [7] -194.48985 -334.35080 -230.05472 -228.01917 -447.14234 -242.40478
## [13] -230.67174 -427.97375 -229.75620 -218.39610 -463.71704 -240.71106
## [19] -300.62578 -240.70977 -203.62176 -302.31227 -166.68854 -545.26135
## [25] -158.05707 -532.52753 -147.29640  -22.13153  -49.60210  -22.28224
## [31] -131.29384  -25.03519  -25.52783  -20.74420  -53.46343  -32.29837
## [37]  -24.83281  -43.48702  -29.79099  -28.66920  -42.85319  -30.24563
## [43]  -23.05305  -36.96603  -18.19795  -26.96707  -16.30551  -24.03485
## [49]  -17.19730  -32.75682  -22.01300  -16.86407  -22.38140  -27.10596

On the whole, the average duration changes very little as seen in the charts above. Looking at the 3 hour trip ride data for members, all changes in average are under 5 minutes (see the column titled diff_dur). For casual users, the change ranges from around a 7 to 40-minute difference, with the exception being two huge differences for Jan and Feb 2020, as the previous Jan to Mar 2020 scatter plot suggested. Almost all larger changes occur in the quarterly data set before April 2020. In addition, the difference in ride numbers range from 83 to 6739, with all member ride number differences in the hundreds. Again, this shows that only a very small percentage of the total riders who are taking trips over 3 hrs, and the vast majority are from casual users.

Taking into consideration the 24 hr data, we see the same general trends in the differences. Narrowing the data has almost no effect on the average member duration. A difference of the differences display a range from a max of less than 15 minutes to a min of 16 seconds. Again, the biggest differences are seen in the Q1 2020. This means that aggregating the data from 24 hour trips to 3 hour trips does not show a major change in the average duration length. And, the difference in the number of rides for trips under 24 hours compared to the whole vary from 10 to 667. There are so few rides over 24 hours that the ones that are are distorting the data and not accurately representing how users use the bikes. This suggests to me that something changed in how the bikes were managed or dealt with, or how the data was managed or collected. Again, I would inquire more into this situation if it were not a role-play. Eliminating the noise in the long tails would give us a better picture of how users, both member and casual, use the bike share service.

This noise can be confirmed with some density diagrams below. For 24 hr trips, the data still has a huge right-skew. There is a 99% chance that a casual member (bandwidth of about 60 seconds) uses a bike for about 3 hours and 20 mins or less; the 99-percentile for members (bandwidth of about 20 seconds) is just over 45 mins. Comparing the 24hr chart to the 3hr chart, there still is a positive-skew, and for members the 99 percentile is virtually the same. For casual users, 95% of users ride a bike for around 100 minutes or less in both cases.

# Data Frames that filter out trip rides over 24 hours
member_24 <- all_trips_v2 %>%
  filter(member_casual == "member" & ride_length < 86400) %>%
  drop_na(ride_length)

casual_24 <- all_trips_v2 %>%
  filter(member_casual == "casual" & ride_length < 86400) %>%
  drop_na(ride_length)

# Density Plots and Key Percentiles for 24 hrs
member_24 %>%
  ggplot( aes(x=ride_length)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Density of Members: Rides up to 24 Hrs")

#density_member_24 <- density(member_24$ride_length)
#plot(density_member_24, main = "Density of Members: Rides up to 24 Hrs")
"Quantiles for Members with Ride Length under 24-Hours"
## [1] "Quantiles for Members with Ride Length under 24-Hours"
quantile(member_24$ride_length, c(.5, .75, .9, .95, .99))
##  50%  75%  90%  95%  99% 
##  628 1065 1662 2084 3092
casual_24 %>%
  ggplot( aes(x=ride_length)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Density of Casual Riders: Rides up to 24 Hrs")

#density_casual_24 <- density(casual_24$ride_length)
#plot(density_casual_24, main = "Density of Casual Riders: Rides up to 24 Hrs")
"Quantiles for Casual Riders with Ride Length under 24-Hours"
## [1] "Quantiles for Casual Riders with Ride Length under 24-Hours"
quantile(casual_24$ride_length, c(.5, .75, .9, .95, .99))
##   50%   75%   90%   95%   99% 
##  1331  2434  4610  6542 11939
# Data Frames that filter out trip rides over 3 hours
member_3 <- all_trips_v2 %>%
  filter(member_casual == "member" & ride_length < 10800) %>%
  drop_na(ride_length)

casual_3 <- all_trips_v2 %>%
  filter(member_casual == "casual" & ride_length < 10800) %>%
  drop_na(ride_length)

# Density Plots and Key Percentiles for 3 hrs

member_3 %>%
  ggplot( aes(x=ride_length)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Density of Members: Rides up to 3 Hrs")

#density_member_3 <- density(member_3$ride_length)
#plot(density_member_3, main = "Density of Members: Rides up to 3 Hrs")

"Quantiles for Members with Ride Length under 3-Hours"
## [1] "Quantiles for Members with Ride Length under 3-Hours"
quantile(member_3$ride_length, c(.5, .75, .9, .95, .99))
##  50%  75%  90%  95%  99% 
##  627 1063 1656 2074 2998
casual_3 %>%
  ggplot( aes(x=ride_length)) +
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Density of Casual Riders: Rides up to 3 Hrs")

#density_casual_3 <- density(casual_3$ride_length)
#plot(density_casual_3, main = "Density of Casual Riders: Rides up to 3 Hrs")

"Quantiles for Casual Riders with Ride Length under 3-Hours"
## [1] "Quantiles for Casual Riders with Ride Length under 3-Hours"
quantile(casual_3$ride_length, c(.5, .75, .9, .95, .99))
##  50%  75%  90%  95%  99% 
## 1313 2356 4324 5905 8959

With such a big skew of the distribution when considering all the recorded trips, I concluded that reducing the data set would give me a more accurate picture of how member and casual users differ. Only about 1 in 180 rides were over 3 hours while 1 in 1450 rides were over 24 hours, thus I decided to limit my data to rides under 3 hours.

The New Data Set: Part 2

For the sake of bin-width and to allow for usage a little over the 3-hour benchmark, I decided to limit my data set to 12,900 seconds, or 3 hours and 35 minutes. I examined the refined data set organized by member-casual and three other aggregations: per month, per day of the week, and per bike type. First, I looked at some basic summary statistics of the refined data set, and then explored ride length and number of rides in more detail. Here are the summary charts for aggregation by month and by day of the week.

all_trips_v3 <- all_trips_v2[!(all_trips_v2$ride_length > 12900 ),]

all_trips_v3 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month) %>%  
  summarise(number_of_rides = n()   
            , average_duration = mean(ride_length), median_duration = median(ride_length)) %>%
  arrange(member_casual, month) %>%
  rename(User = member_casual, Month = month) %>%
  print(n = Inf)
## # A tibble: 24 x 5
## # Groups:   User [2]
##    User   Month number_of_rides average_duration median_duration
##    <chr>  <chr>           <int>            <dbl>           <dbl>
##  1 casual 01              23350            1273.             818
##  2 casual 02              21099            1685.            1136
##  3 casual 03             102510            1822.            1242
##  4 casual 04             195884            1903.            1307
##  5 casual 05             395608            1996.            1363
##  6 casual 06             606281            1903.            1273
##  7 casual 07             438118            2279.            1546
##  8 casual 08             467493            2094.            1423
##  9 casual 09             347784            1886.            1274
## 10 casual 10             200764            1646.            1095
## 11 casual 11              96504            1593.            1031
## 12 casual 12              42460            1461.             975
## 13 member 01             208197             659.             503
## 14 member 02             162805             682.             515
## 15 member 03             250173             783.             586
## 16 member 04             463227             814.             614
## 17 member 05             645176             862.             654
## 18 member 06             853900             878.             680
## 19 member 07             662016             912.             716
## 20 member 08             729377             877.             683
## 21 member 09             653365             823.             634
## 22 member 10             524629             746.             570
## 23 member 11             314192             721.             541
## 24 member 12             231522             679.             515
all_trips_v3 %>%
  drop_na(member_casual, day_of_week) %>%
  group_by(member_casual, day_of_week) %>%  
  summarise(number_of_rides = n()   
            , average_duration = mean(ride_length), median_duration = median(ride_length)) %>%
  arrange(member_casual, day_of_week) %>%
  rename(User = member_casual, Day = day_of_week)
## # A tibble: 14 x 5
## # Groups:   User [2]
##    User   Day   number_of_rides average_duration median_duration
##    <chr>  <ord>           <int>            <dbl>           <dbl>
##  1 casual Sun            573229            2159.            1482
##  2 casual Mon            322066            1946.            1315
##  3 casual Tue            304421            1810.            1196
##  4 casual Wed            315910            1764.            1163
##  5 casual Thu            327051            1778.            1173
##  6 casual Fri            414007            1851.            1237
##  7 casual Sat            681171            2094.            1440
##  8 member Sun            624172             935.             708
##  9 member Mon            831142             786.             602
## 10 member Tue            896323             784.             604
## 11 member Wed            911268             787.             608
## 12 member Thu            875244             787.             606
## 13 member Fri            854484             798.             612
## 14 member Sat            705946             928.             707

Looking at ride length for monthly figures, in the first graph below, average duration tends to increase as the weather gets nicer (with the exception of June for casual riders), with a low in January and a peak in July, and decreases as the weather cools. This is sensible considering the geographic location of Chicago.

all_trips_v3 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month) %>%  
  summarise(average_duration = mean(ride_length)) %>%
  ggplot(aes(x = month, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  ggtitle("Monthly Average Duration: Member vs Casual")

When looking at the same monthly data aggregated on a yearly basis in the second graph, I observed some exceptions. For casual riders, June durations were always less than May, and May is similar to or higher than July. There were small spikes in December 2019 and November 2020 when that month was higher than the previous month. Interestingly, I think we can see the effects of the electric bikes, as the average duration times are lower for 2021 in all months compared to the previous two years. For members, the average duration is much more consistent in rising and falling with the seasons. Average ride times are significantly higher (by around 5 minutes) in the warmer months in 2020, during the first year of the COVID crisis, with average times for 2021 being comparable to 2019. It is important to note that the average durations for 2021 are actually quite flat from February to June, suggesting that maybe members are consistently using electric bikes for commuting or fixed trips, thus leveling out the average ride length.

Comparing casual riders to members, the average ride length is twice as long or more. Especially this was the case for 2019, where there were months where the average ride was closer to three times longer. However, after the introduction of electric bikes, the average duration for casual riders on the whole decreased compared to previous years, and in the winter months, were less than twice as long as rides by members.

all_trips_v3 %>%
  drop_na(member_casual, year, month) %>%
  group_by(member_casual, year, month) %>%  
  summarise(average_duration = mean(ride_length)) %>%
  ggplot(aes(x = month, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  facet_wrap(~year) +
  ggtitle("Monthly Average Duration by Year: Member vs Casual")

Looking at ride length for days of the week, I saw consistent U-shapes for both member and casual riders with higher durations on Saturdays and Sundays as seen in the first graph here.

all_trips_v3 %>%
  drop_na(member_casual, day_of_week) %>%
  group_by(member_casual, day_of_week) %>%  
  summarise(average_duration = mean(ride_length)) %>%
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  ggtitle("Day of Week Average Duration: Member vs Casual")

Jumping to the yearly aggregated data, ride length for weekdays by members were almost uniformly flat across all time periods (with a slight negative variation in 2021), suggesting that members regularly use the bikes for fixed periods of time, i.e. commuting. Durations for casual members showed a bit more of a sharper U-shape, with sharper decreases and increases than members, bottoming out on Wednesday or Thursday. 2021 was again interesting for casual riders showing a steady decrease from a maximum on Sunday to a minimum duration on Thursday, with all average durations for weekdays being less than the previous two year. The highest ride length across the three years were in 2019 for casual riders. On the other hand, 2020 had the highest ride length overall for members, but only ever so slightly, with average durations being pretty much equal across the three years. I believe these trends are the result of electric bikes being introduced and used more by casual riders, and the consistent use of the bike share program by members for commuting.

Comparing the two directly, average duration tends to be twice as long or more for casual riders, with the difference getting smaller each year, as the average ride length for casual users decreased across the board over the three year, while it reminded pretty much the same for members.

all_trips_v3 %>%
  drop_na(member_casual, year, day_of_week) %>%
  group_by(member_casual, year, day_of_week) %>%  
  summarise(average_duration = mean(ride_length)) %>%
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  facet_wrap(~year) +
  ggtitle("Day of Week Average Duration by Year: Member vs Casual")

The charts for the number of rides taken by members versus casual users tell a different story. At first glance, the graph below shows that total rides by members is always higher than by casual riders, ranging from around 5 to 9 times as many in the winter months (also see the chart above at the beginning of this section, ‘The New Data Set: Part 2’), to only about 1.5 as many in the summer months, with the remaining months being at about 2 times as many. This indicates a severe gap in the proportion of casual riders to members who continue to take trips when the weather is cold. As expected, the ridership numbers for both increase and decrease with the changing seasons. It is important to note that only the second quarter (April/May/June) include ride counts for all three years, with the third and fourth quarters including the years 2019 and 2020, while the first quarter only includes 2020 and 2021. This is why I focused on the proportional difference between member and casual riders.

all_trips_v3 %>%
  drop_na(member_casual, month) %>%
  group_by(member_casual, month) %>%  
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  ggtitle("Monthly Number of Rides: Member vs Casual")

It is more telling to look in detail at the number of rides aggregated by year as seen in the graphs and chart below. I discovered some very interesting observations as each year had very different numbers. In 2019, there was a stark difference between the number of rides taken by casual users and members, with the number being at least twice as large or much more across the board (as noted above in the non-aggregated data). The gap is at its largest in January 2020, but things changed dramatically from May 2020 onward. May is the first month where the gap is quite small with the proportion of casual riders making up almost 3/4 of the ridership numbers for members (or just 1.32 times larger), and the difference continues to narrow from there. In July, the numbers are almost equal. The difference starts to widen again, and only dips below 1/2 from December 2020 to February 2021. Again in 2021, the proportion of casual riders to members increases as the weather gets warmer, and, for the final month in the data, June 2021, the number of casual riders actually exceeds the number of members. This could indicate that members became casual riders as a result of the uncertainty of COVID, or that more people became interested in the bike share program as a result of changes in working habits due to COVID. I think this avenue of though would be interesting to explore in more detail.

all_trips_v3 %>%
  drop_na(member_casual, year, month) %>%
  group_by(member_casual, year, month) %>%  
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~year) +
  ggtitle("Monthly Number of Rides by Year: Member vs Casual")

all_trips_v3 %>%
  drop_na(member_casual, year, month) %>%
  group_by(member_casual, year, month) %>%  
  summarise(number_of_rides = n()) %>%
  arrange(member_casual, year, month) %>%
  rename(User = member_casual) %>%
  print(n = Inf)
## # A tibble: 54 x 4
## # Groups:   User, year [6]
##    User   year  month number_of_rides
##    <chr>  <chr> <chr>           <int>
##  1 casual 2019  04              47416
##  2 casual 2019  05              80959
##  3 casual 2019  06             129166
##  4 casual 2019  07             174139
##  5 casual 2019  08             185212
##  6 casual 2019  09             128041
##  7 casual 2019  10              70246
##  8 casual 2019  11              18367
##  9 casual 2019  12              16140
## 10 casual 2020  01               7531
## 11 casual 2020  02              12029
## 12 casual 2020  03              24250
## 13 casual 2020  04              23296
## 14 casual 2020  05              85499
## 15 casual 2020  06             152202
## 16 casual 2020  07             263979
## 17 casual 2020  08             282281
## 18 casual 2020  09             219743
## 19 casual 2020  10             130518
## 20 casual 2020  11              78137
## 21 casual 2020  12              26320
## 22 casual 2021  01              15819
## 23 casual 2021  02               9070
## 24 casual 2021  03              78260
## 25 casual 2021  04             125172
## 26 casual 2021  05             229150
## 27 casual 2021  06             324913
## 28 member 2019  04             217407
## 29 member 2019  05             285599
## 30 member 2019  06             344846
## 31 member 2019  07             381374
## 32 member 2019  08             403036
## 33 member 2019  09             363845
## 34 member 2019  10             300581
## 35 member 2019  11             158308
## 36 member 2019  12             138565
## 37 member 2020  01             136011
## 38 member 2020  02             126634
## 39 member 2020  03             115493
## 40 member 2020  04              61019
## 41 member 2020  05             113114
## 42 member 2020  06             187740
## 43 member 2020  07             280642
## 44 member 2020  08             326341
## 45 member 2020  09             289520
## 46 member 2020  10             224048
## 47 member 2020  11             155884
## 48 member 2020  12              92957
## 49 member 2021  01              72186
## 50 member 2021  02              36171
## 51 member 2021  03             134680
## 52 member 2021  04             184801
## 53 member 2021  05             246463
## 54 member 2021  06             321314

For number of rides aggregated by day of the week, the overall trend for casual riders is U-shaped, with much larger total rides on the weekends, and slightly higher ride numbers on Fridays versus the other weekdays. On the other hand, the overall trend for members is the opposite. The number of rides is at the lowest on Sundays, increases and peaks on Wednesdays, and decreases again down to Saturdays. Comparing members to casual users, total rides are around 3 times as many Mondays to Thursdays, about 2 times as many on Fridays; however, the weekend shows the number of rides to be almost equal, with members having a slight edge. Numbers for Saturdays for both types of riders is a little larger than Sundays.

all_trips_v3 %>%
  drop_na(member_casual, day_of_week) %>%
  group_by(member_casual, day_of_week) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  ggtitle("Day of Week Number of Rides: Member vs Casual")

Examining the data by year again tells a more complete story. 2019 follows the overall trend across all years for both types of riders, except it has a more prominent drop in the number of rides on the weekends for members. 2020 shows some subtle shifting trends. For members, the overall trend is almost the same, except the drop in weekend ridership is not prominent; that is because the overall number of rides on weekdays dropped dramatically compared to 2019, while the number of rides on Saturdays and Sundays increased (with Saturday being even higher than Monday). Interestingly, the trend for casual riders again matched 2019 and the overall trend; however, the total number of rides across both weekdays and the weekend increased, with the number of Saturday rides equaling the number of members’ Saturday rides. It is important to note that the numbers for 2019 are missing the first quarter, the winter months, so that would have a slight impact on making the overall numbers for 2019 a little bigger, but would probably not impact the overall trend too much.

The year 2021 only includes 6 months of data, but from the initial numbers it looks like total numbers are on track to outpace 2020, as also indicated in the monthly data. Again, the trend for casual users follows the same course, with the only exception being a slight dip in numbers on Thursdays. Members’ numbers show the same with a dip on Thursdays. It is more important to notice that the gap between casual riders and members on weekdays has narrowed year on year, while for both Saturday and Sunday, the gap has reversed: casual riders outnumber members.

all_trips_v3 %>%
  drop_na(member_casual, year, day_of_week) %>%
  group_by(member_casual, year, day_of_week) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  facet_wrap(~year) +
  ggtitle("Day of Week Number of Rides by Year: Member vs Casual") 

Looking at the total ride numbers in proportion of members and casual riders is also very telling. The first chart shows the monthly data where the sharpest decrease year-wise between 2019 and 2020 being at the beginning of the COVID crisis, namely April, May, and June. Tellingly, the numbers jumped in July 2020, with the totals for July to November almost equaling or exceeding 2019. As noted earlier, starting from May 2020 the proportion of member to casual riders approached parity at 50-50 until the winter which saw decreased numbers to the previous year, and similar proportions. Starting in March 2021, there was a surge in ridership each month, with increased numbers compared to both previous years. This overall increase was accounted for by the change in the number of casual users’ rides which really skyrocketed compared to both 2019 and 2020, culminating with the number of casual riders surpassing members in June 2020. At the same time, the number of member rides in March, April, May and June were comparable to May, June, July, and August of 2020; that means the numbers for earlier months in 2021 were reaching the peak ride season numbers of 2020. Those stats for April to June were almost a return to 2019 numbers for members, indicating a recovery to pre-COVID numbers. Combined with the huge increase in casual users, the overall trend in 2021 starting from March suggests that 2021 will have higher numbers than the two previous years, at least for the warmer months.

all_trips_v3 %>%
  drop_na(member_casual, year, month) %>%
  group_by(member_casual, year, month) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~year) +
  ggtitle("Monthly Total Number of Rides by Year: Member vs Casual")

Taking a look at the yearly stats for days of the week, I again want start off by noting that 2019 numbers do not include the first three months of 2019, and thus the overall numbers would be a little more than shown. But given the data from the other years, it would likely make the proportions between member and casual users even less, while at the same time, only slightly increasing overall numbers. Because the proportion of member riders outweighed casual riders in 2019, as seen in the monthly data, the trend for 2019 matches the overall member trend (as seen above) with the lowest numbers on Sundays, almost uniform numbers for weekdays, and Saturday numbers falling in between. 2020, however, shows a different pattern because the number of casual riders increased. The trend resembles the overall trend for casual riders: a maximum ridership on Saturdays, a minimum on Mondays with weekday numbers gradually increasing to Fridays, which is equal to Sundays. Comparing 2019 to 2020 directly, the data seems to indicate (even if it is lacking 3 months in 2019) that the overall numbers in 2020 were higher on the weekends, while overall numbers on weekdays were lower. The trend for 2021 follows the same pattern as 2020 except with a flatter numbers across the weekdays (with a dip on Thursdays as previously noted). As discussed in the monthly data, with numbers for earlier months in 2021 matching peak season in 2020, the data is telling us that overall numbers for 2021 should surpass 2020 and 2019.

all_trips_v3 %>%
  drop_na(member_casual, year, day_of_week) %>%
  group_by(member_casual, year, day_of_week) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~year) +
  ggtitle("Day of Week Total Number of Rides by Year: Member vs Casual")

These two trend shifts: a more equal proportion of casual riders to members on the monthly level, and change to higher ridership numbers on the weekends on the day of the week level, may indicate a change in attitude by the public as a result of COVID. I believe that starting in April 2020, members were probably commuting less because of a shift in working habits away from the office due to lock downs and social distancing. It would be insightful to know if the annual membership renewals also went down in this period and I would inquire and see if I could get more information in a real-life situation, since the numbers indicate this could be the case. If that is so, then members might not have renewed their annual membership and only use the bikes for commuting when they needed to go to the office to check in during the height of the COVID pandemic. Since member numbers recovered in 2021, it suggests that renewals are back up to pre-pandemic levels or the bike share program is seeing an increase in new annual members. The increase in overall ridership as a result of the increase in casual riders may be due the effect of electronic bikes. Again, if people only have to go to the office occasionally, new riders may use the bikes for that purpose during the week. I think the increase in ridership on weekends is a result of “isolation fever”; people want to get out of the house and enjoy their time. With the risk serious infection in an outdoor environment seeming to be lower than thought at the beginning of the pandemic, peopel want to get out and enjoy the nice weather in Chicago … we have to focus on this demographic!!!!!!

To investigate this in more detail, I decided to aggregate the data by bike type to see if the introduction of electronic bikes was having an impact on the numbers, as I suspected they were. I first looked at the charts to see the numbers in aggregated detail

all_trips_v3 %>%
    drop_na(year, month, rideable_type) %>%
    group_by(year, month, rideable_type) %>%
    summarise(number_of_rides = n()
              , average_duration = mean(ride_length)) %>%
    arrange(rideable_type, year, month) %>%
    print(n = Inf)
## # A tibble: 46 x 5
## # Groups:   year, month [27]
##    year  month rideable_type number_of_rides average_duration
##    <chr> <chr> <chr>                   <int>            <dbl>
##  1 2020  12    classic_bike            70511             841.
##  2 2021  01    classic_bike            61559             793.
##  3 2021  02    classic_bike            34746             969.
##  4 2021  03    classic_bike           152219            1064.
##  5 2021  04    classic_bike           214041            1095.
##  6 2021  05    classic_bike           308064            1182.
##  7 2021  06    classic_bike           433358            1144.
##  8 2019  04    docked_bike            264823            1007.
##  9 2019  05    docked_bike            366558            1110.
## 10 2019  06    docked_bike            474012            1202.
## 11 2019  07    docked_bike            555513            1285.
## 12 2019  08    docked_bike            588248            1255.
## 13 2019  09    docked_bike            491886            1111.
## 14 2019  10    docked_bike            370827             923.
## 15 2019  11    docked_bike            176675             738.
## 16 2019  12    docked_bike            154705             754.
## 17 2020  01    docked_bike            143542             670.
## 18 2020  02    docked_bike            138663             737.
## 19 2020  03    docked_bike            139743             984.
## 20 2020  04    docked_bike             84315            1320.
## 21 2020  05    docked_bike            198613            1632.
## 22 2020  06    docked_bike            339942            1565.
## 23 2020  07    docked_bike            542842            1632.
## 24 2020  08    docked_bike            550074            1466.
## 25 2020  09    docked_bike            400642            1314.
## 26 2020  10    docked_bike            233633            1137.
## 27 2020  11    docked_bike            150242            1138.
## 28 2020  12    docked_bike             12651            1166.
## 29 2021  01    docked_bike              2057            1934.
## 30 2021  02    docked_bike              1218            2205.
## 31 2021  03    docked_bike             15309            2512.
## 32 2021  04    docked_bike             24162            2633.
## 33 2021  05    docked_bike             42217            2662.
## 34 2021  06    docked_bike             50269            2595.
## 35 2020  07    electric_bike            1779            1188.
## 36 2020  08    electric_bike           58548            1258.
## 37 2020  09    electric_bike          108621            1107.
## 38 2020  10    electric_bike          120933             941.
## 39 2020  11    electric_bike           83779             944.
## 40 2020  12    electric_bike           36115             788.
## 41 2021  01    electric_bike           24389             734.
## 42 2021  02    electric_bike            9277             869.
## 43 2021  03    electric_bike           45412             984.
## 44 2021  04    electric_bike           71770            1029.
## 45 2021  05    electric_bike          125332            1103.
## 46 2021  06    electric_bike          162600            1077.
# all_trips_v3 %>%
#     drop_na(member_casual, year, month, rideable_type) %>%
#     group_by(member_casual, year, month, rideable_type) %>%  
#     summarise(number_of_rides = n()   
#               , average_duration = mean(ride_length)) %>%
#     arrange(member_casual, rideable_type, year, month) %>%
#     rename(User = member_casual) %>%
#     print(n = Inf)

Taking a look at the same monthly data for overall numbers by rideable_type in the charts above and below, we see a measurable effect on ridership numbers in August 2020 compared to 2019, the second month of ebikes. The chart above shows that July only had 1779 ebike rides the first month they were introduced, but increased drastically to 58548 the next month. For September and October in the graph below, the numbers between 2019 and 2020 are basically the same, while November shows an impact with increased ridership compared to 2019. With the exception of July where numbers recovered after the initial shock of COVID from April to June with minimal influence of ebikes, it seems that ebikes meaningfully impacted overall ridership for 2020 numbers and the effects can be seen more in 2021.

In the winter months, the overall numbers decreased relative to 2019, but interestingly, the proportion of pedal to ebikes decreased, indicating that winter users were more likely to chose an ebike. As the weather warms in 2021, the proportion of pedal to ebikes increased again, but at the same time, the number of ebikes in use also increased a lot, with numbers of rides by ebikes in April 2021 almost matching August 2020. In May and June, ebike rides were higher than the peak month of October 2020, surpassing 120K rides.

Looking at the pedal bike numbers (as discussed earlier, pedal bikes are both classic and docked type) for April to June 2021, the numbers match the pedal bike rides from 2019; thus, clearly showing the ebikes are helping to propel the 2021 overall ride numbers to new heights.

The effect of electronic bikes on the overall numbers is obvious. Honing in on the proportion of pedal bikes (to remind the reader: pedal bikes are classic and docked types) in April, May, and June of 2021, these measures are a little less than the ridership numbers in 2019 for those same months when there were only docked (pedal) type bikes. Thus, this graph shows that the introduction of electronic bikes boosted the overall ride numbers in the Bike Sharing Program, with June 2021 having the highest ridership numbers across all months and years.

all_trips_v3 %>%
    drop_na(year, rideable_type) %>%
    group_by(year, rideable_type) %>%
    summarise(number_of_rides = n()
              , average_duration = mean(ride_length)) %>%
    arrange(rideable_type, year) %>%
    print(n = Inf)
## # A tibble: 7 x 4
## # Groups:   year [3]
##   year  rideable_type number_of_rides average_duration
##   <chr> <chr>                   <int>            <dbl>
## 1 2020  classic_bike            70511             841.
## 2 2021  classic_bike          1203987            1112.
## 3 2019  docked_bike           3443247            1113.
## 4 2020  docked_bike           2934902            1354.
## 5 2021  docked_bike            135232            2600.
## 6 2020  electric_bike          409775            1019.
## 7 2021  electric_bike          438780            1044.
# all_trips_v3 %>%
#     drop_na(member_casual, rideable_type) %>%
#     group_by(member_casual, rideable_type) %>%  
#     summarise(number_of_rides = n()   
#               , average_duration = mean(ride_length)) %>%
#     print(n = Inf)

# all_trips_v3 %>%
#     drop_na(year, month, rideable_type) %>%
#     group_by(year, month, rideable_type) %>%  
#     summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
#     print(n = Inf)

all_trips_v3 %>%
    drop_na(member_casual, year, month, rideable_type) %>%
    group_by(member_casual, year, month, rideable_type) %>%  
    summarise(number_of_rides = n()) %>%
    ggplot(aes(x = month, y = number_of_rides, fill = rideable_type)) +
    geom_bar(stat = "identity") + 
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(~year)

When examining the day of week data for bike type, I aggregated by user type and year. Here, we see that December 2020 was indeed the last year that the docked type designation was used for members. The first interesting thing to note here is that for members, weekend use of bikes increased dramatically in 2020 and 2021, compared to 2019, with little influence from the type of bike. Again, the decrease in weekday ridership and the increase in weekend use could be attributed to COVID. The second interesting thing to note is the pretty even use of ebikes by both members and casual users across years and days of the week. Casual users have slightly increased use of ebikes on weekends, but nothing compared to the bigger difference in weekend use of pedal bikes by casual riders. Weekday use of ebikes is marginally higher for members than on weekend, and is negligible at best. One reason for this consistent use could be a limited number of ebikes. An investigation into the total number of bikes of all types would be needed to better understand this.

all_trips_v3 %>%
    drop_na(member_casual, year, day_of_week, rideable_type) %>%
    group_by(member_casual, year, day_of_week, rideable_type) %>%  
    summarise(number_of_rides = n()) %>%
    ggplot(aes(x = day_of_week, y = number_of_rides, fill = rideable_type)) +
    geom_bar(stat = "identity") + 
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(~member_casual + year)

I believe the increase in ridership was the result of COVID. Members were probably commuting less because of a shift in working habits away from the office. Some members may only need to use a bike to go to work once in a while, and thus may have shifted to casual use. While at the same time, with the shift in workplace habit in addition to ebikes, new users may have been trying the system and using the bikes when needed, or taking advantage of the system to get some exercise and a break from COVID isolation. T he data also indicates that even with 3 hour ride allowance for day pass users, there is little evidence that casual riders take advantage of this. There is higher ridership numbers by casual riders on weekends, which may indicate more day pass users, but the average duration times do time support any large influx of day pass riders, at least non who take long rides. Most riders keep the trips short, regardless of the type of user.

As stated in my business objectives, I wanted to see how casual riders and members differ when looking at trip lengths at different benchmarks.

Using the the same data set of rides under 3 hours and 35 minutes (12, 900 seconds), I started off my analysis here by calculating quantile figures for both types of users, then for the data filtered by members and casual rider separately. The numbers found below indicate that 99% of members’ ride lengths are under 51 minutes; while on the other hand, 99% of casual riders’ ride lengths are under 161 minutes; more than three times longer. At both the 25% and 50% quantile, members’ ride lengths are around half of casual riders’ ride lengths. This indicates that the riders who take longer trips are more likely to be casual riders.

"Quantile Values for All Users for Ride Length under 3 hrs 35 mins"
## [1] "Quantile Values for All Users for Ride Length under 3 hrs 35 mins"
quantile(all_trips_v3$ride_length, c(.5, .75, .9, .95, .99), na.rm=TRUE)
##  50%  75%  90%  95%  99% 
##  786 1436 2431 3683 7538
"Quantile Values for Members"
## [1] "Quantile Values for Members"
quantile(subset(all_trips_v3, member_casual == "member")$ride_length, c(.25, .5, .75, .9, .95, .99), na.rm=TRUE)
##  25%  50%  75%  90%  95%  99% 
##  373  627 1064 1658 2076 3020
"Quantile Values for Casual Users"
## [1] "Quantile Values for Casual Users"
quantile(subset(all_trips_v3, member_casual == "casual")$ride_length, c(.25, .5, .75, .9, .95, .99), na.rm=TRUE)
##  25%  50%  75%  90%  95%  99% 
##  730 1319 2383 4423 6113 9661

Based on the quantile figures above, I broke down the data into different bins of different ride lengths and different filters based on the quantile figures for the data, as seen in the first chart above. The following histograms give a sense of the scale and difference between member and casual users.

In the first one, which is broken down into 2-minute rides with the data filtered to cover 95% of the data (3780 seconds), we see a peak for the number of rides of under 200,000 for casual users, with a duration peak of between 540 to 660 seconds, while members peak at under 800,000 rides with a duration peak of 420 to 540 seconds. Both types of users reach a maximum ride count at a ride length duration around 9 minutes (the mode: 7-9 mins for members and 9-11 mins for casual riders respectively). From there, the ride counts of members decline much more quickly and consistently than casual users. The rate of change between 120-second time bins decrease fairly evenly for members as ride length grows. On the other hand, the rate of change for casual users is more steady and less consistent. The single biggest decline in the casual graph occurs at the mark between 1860 to 1980 seconds, or just after the 30 minute mark, which is not surprising given the ride conditions as stated at the beginning of this report (see section ‘About the Business Task’). Comparing the two histograms, the count of rides for members is greater than casual users up until around the 2000 second second mark. After that, the ride count for casual users is greater, but not to the same magnitude that member counts were before that turning point.

# 3780 seconds covers around 95% of the data
all_trips_v3 %>%
  filter(ride_length < 3780) %>%
  ggplot( aes(x=ride_length)) +
  geom_histogram(binwidth = 120, fill = "#56B4E9") +
  scale_y_continuous(labels = scales::comma) +
  facet_wrap(~member_casual) +
  labs(title = "Count of Duration under 63 minutes by 2-minute Rides: 95% of the Data", caption = "Note: the first bin includes ride length of 0 to 1 minute")

When broken into 30-minute rides and filtered to cover 99% of the data (8100 seconds), we see the first 15 minute bin entails the majority of the member rides, while the 15-45 minute bin entails the majority of casual rides, indicating that casual riders tend to take slightly longer trips. The second bin also shows an almost equal number of trips by member and casual riders. These first two bins are dominated by member users, and overwhelmingly show that the majority of members and casual users take trips under 45 minutes. In fact, the total ride numbers are close to 8M and are indicative of the how most users use the bike sharing system. For rides under 15 minutes, it is more likely to be from a member, while for rides between 15 and 45 minutes, the likelihood is closer to 50-50. The remaining bins ((in mins) 45-75, 75-105, 105-135) are all considerably smaller and the majority of rides in these bins are from casual users. This indicates that if we were to take a longer trip over 45 minutes, it is more likely to be from a casual rider.

# Up to 8100 seconds covers around 99% of data
all_trips_v3 %>%
  filter(ride_length < 8100) %>%
  ggplot( aes(x=ride_length, fill = member_casual)) +
  scale_y_continuous(labels = scales::comma) +
  geom_histogram(binwidth = 1800) +
  labs( title = "Count of Total Duration under 135 Minutes (2 hours 15 minutes), by 30-minute Rides: 99% of Data", caption = "Note: the first bin includes ride length 0 to 15 minutes")

The final chart shows the above chart without the first two bins (the first bin is 45 mins to 75 mins, the second is 75 to 105 mins, the third bin is 105 to 135 mins- the three smaller bins from above) and includes the ride counts larger than 2 hours and 15 mins (bins (in mins): 135-165, 165-195, 195-215; the last bin only spanning only 20 minutes as the data includes up to 3 hrs and 35 mins). In all of the 30-minute ride length groupings, casual riders clearly outweigh members proportionally with a magnitude 6 times greater or more. This chart reinforces the fact that the vast majority of longer trips, in this case longer meaning over 45 mins, are taken by casual riders.

Even so, the total number of trips over 45 mins are on a much smaller magnitude when compared to trips under 45 mins. For example, looking at the chart above and below, a 30-minute ride span of 15 to 45 mins includes around 3M trips while the next 30-minute ride span (45 to 75 mins) has just over 400K. That is around a 90% drop in ride numbers. And, the number of rides in the first group, only a 15-minute ride length, includes close to 5M rides. The only conclusion is that most riders use the bike share system for shorter trips.

all_trips_v3 %>%
  filter(ride_length >= 2700) %>%
  ggplot( aes(x=ride_length, fill = member_casual)) +
  geom_histogram(binwidth = 1800) +
  scale_y_continuous(labels = scales::comma) +
  labs( title = "Count of Total Duration 45 Minutes to 3 hour 30 minutes, by 30-minute Rides", caption = "Note: the first bin includes ride length 45 to 75 minutes")

# all_trips_v3 %>%
#   filter(ride_length < 4140) %>%
#   ggplot( aes(x=ride_length, fill = member_casual)) +
#   geom_histogram(binwidth = 120) +
#   scale_y_continuous(labels = scales::comma)


# all_trips_v3 %>%
#   filter(ride_length >= 1800) %>%
#   ggplot( aes(x=ride_length, fill = member_casual)) +
#   geom_histogram(binwidth = 1800) +
#   scale_y_continuous(labels = scales::comma) +
#   labs( title = "Count of Total Duration 30 Minutes to 3 hour 30 minutes, by 30-minute Rides", caption = "Note: the first bin includes ride length 30 to 45 minutes")
# all_trips_v3$year_month <- format(as.Date(all_trips_v3$date), "%Y/%m")


# __________________________________________________
# all_trips_v3 %>%
#   drop_na(member_casual, rideable_type) %>%
#   group_by(member_casual, rideable_type) %>%
#   summarise(number_of_rides = n()) %>%
#   arrange(member_casual, rideable_type)
#   # print(n = Inf)
# # 
# all_trips_v3 %>%
#   drop_na(year_month, member_casual, rideable_type) %>%
#   group_by(year_month, member_casual, rideable_type) %>%
#   summarise(number_of_rides = n()) %>%
#   ggplot(aes(x = year_month, y = number_of_rides, group = rideable_type, colour = rideable_type)) +
#   geom_line() +
#   facet_wrap(~member_casual) +
#   theme(axis.ticks = element_blank(), axis.text.x = element_blank())
# # 
# all_trips_v3 %>%
#   drop_na(year, month, rideable_type) %>%
#   group_by(year, month, rideable_type) %>%
#   summarise(number_of_rides = n()) %>%
#   ggplot() +
#   geom_bar(aes(x = month, y = number_of_rides), stat = "identity", fill="#CCFFFF", width=0.5) +
#   #geom_point(aes(x = month, y = number_of_rides, group=rideable_type, colour = rideable_type)) +
#   geom_line(aes(x = month, y = number_of_rides, group=rideable_type, colour = rideable_type)) +
#   scale_colour_manual(values=c("#0072B2","#009E73", "#D55E00")) +
#   #scale_fill_hue(c=45,l=80) +
#   facet_wrap(~year)